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Abstract 

In this paper, we consider the coherent theory of (epistemic) uncertainty 
of Walley, in which beliefs are represented through sets of probability dis- 
tributions, and we focus on the problem of modeling prior ignorance about 
a categorical random variable. In this setting, it is a known result that a 
state of prior ignorance is not compatible with learning. To overcome this 
problem, another state of beliefs, called near-ignorance, has been proposed. 
Near-ignorance resembles ignorance very closely, by satisfying some princi- 
ples that can arguably be regarded as necessary in a state of ignorance, and 
allows learning to take place. What this paper does, is to provide new and 
substantial evidence that also near-ignorance cannot be really regarded as a 
way out of the problem of starting statistical inference in conditions of very 
weak beliefs. The key to this result is focusing on a setting characterized by 
a variable of interest that is latent. We argue that such a setting is by far 
the most common case in practice, and we provide, for the case of categorical 
latent variables (and general manifest variables) a condition that, if satisfied, 
prevents learning to take place under prior near-ignorance. This condition is 
shown to be easily satisfied even in the most common statistical problems. 
We regard these results as a strong form of evidence against the possibility to 
adopt a condition of prior near- ignorance in real statistical problems. 
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1 Introduction 



Epistemic theories of statistics are often confronted with the question of prior igno- 
rance. Prior ignorance means that a subject, who is about to perform a statistical 
analysis, is missing substantial beliefs about the underlying data-generating process. 
Yet, the subject would like to exploit the available sample to draw some statistical 
conclusion, i.e., the subject would like to use the data to learn, moving away from 
the initial condition of ignorance. This situation is very important as it is often 
desirable to start a statistical analysis with weak assumptions about the problem of 
interest, thus trying to implement an objective-minded approach to statistics. 

A fundamental question is whether prior ignorance is compatible with learning 
or not. Walley gives a negative answer for the case of his self-consistent (or coherent) 
theory of statistics based on the modeling of beliefs through sets of probability dis- 
tributions. He shows, in a very general sense, that vacuous prior beliefs, i.e., beliefs 
that a priori are maximally imprecise, lead to vacuous posterior beliefs, irrespective 
of the type and amount of observed data [Wal91| Section 7.3.7]. At the same time, 
he proposes focusing on a slightly different state of beliefs, called near-ignorance, 
that does enable learning to take place [Wal91t Section 4.6.9]. Loosely speaking, 
near-ignorant beliefs are beliefs that are vacuous for a proper subset of the func- 
tions of the random variables under consideration (see Section [3]). In this way, a 
near-ignorance prior still gives one the possibility to express vacuous beliefs for some 
functions of interest, and at the same time it maintains the possibility to learn from 
data. The fact that learning is possible under prior near-ignorance is shown, for 
instance, in the special case of the imprecise Dirichlet model (IDM) [ Wal96| rBer05] . 
This is a popular model, based on a near-ignorance set of priors, used in the case of 
inference from categorical data generated by a multinomial process. 

Our aim in this paper is to investigate whether near-ignorance can be really 
regarded as a possible way out of the problem of starting statistical inference in 
conditions of very weak beliefs. We carry out this investigation in a setting made 
of categorical data generated by a multinomial process, like in the IDM, but we 
consider near-ignorance sets of priors in general, not only that used in the IDM. 

The interest in this investigation is motivated by the fact that near-ignorance 
sets of priors appear to play a crucially important role in the question of modeling 
prior ignorance about a categorical random variable. The key point is that near- 
ignorance sets of priors can be made to satisfy two principles: the symmetry and the 
embedding principles. The first is well known and is equivalent to Laplace's indiffer- 
ence principle; the second states, loosely speaking, that if we are ignorant a priori, 
our prior beliefs on an event of interest should not depend on the space of possi- 
bilities in which the event is embedded (see Section [3] for a discussion about these 
two principles). Walley |Wal91j . and later de Cooman and Miranda [DCM06] . have 
argued extensively on the necessity of both the symmetry and the embedding prin- 
ciples in order to characterize a condition of ignorance about a categorical random 
variable. This implies, if we agree that the symmetry and the embedding principles 
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are necessary for ignorance, that near-ignorance sets of priors should be regarded 
as an especially important avenue for a subject who wishes to learn starting in a 
condition of ignorance. 

Our investigation starts by focusing on a setting where the categorical variable X 
under consideration is latent. This means that we cannot observe the realizations of 
X, so that we can learn about it only by means of another, not necessarily categor- 
ical, variable S, related to X through a known conditional probability distribution 
P(S | X). Variable S is assumed to be manifest, in the sense that its realizations can 
be observed (see Section [2]). The intuition behind the setup considered, made of 
X and S, is that in many real cases it is not possible to directly observe the value 
of a random variable in which we are interested, for instance when this variable 
represents a patient's health and we are observing the result of a diagnostic test. In 
these cases, we need to use a manifest variable (the medical test) in order to obtain 
information about the original latent variable (the patient's health). In this paper, 
we regard the passage from the latent to the manifest variable as made by a process 
that we call the observational process^ 

Using the introduced setup, we give a condition in Section [U related to the 
likelihood function, that is shown to be sufficient to prevent learning about X under 
prior near-ignorance. The condition is very general as it is developed for any set of 
priors that models near-ignorance (thus including the case of the IDM), and for very 
general kinds of probabilistic relations between X and S. We show then, by simple 
examples, that such a condition is easily satisfied, even in the most elementary and 
common statistical problems. 

In order to fully appreciate this result, it is important to realize that latent 
variables are ubiquitous in problems of uncertainty. The key point here is that the 
scope of observational processes greatly extends if we consider that even when we 
directly obtain the value of a variable of interest, what we actually obtain is the 
observation of the value rather than the value itself. Doing this distinction makes 
sense because in practice an observational process is usually imperfect, i.e., there 
is very often (it could be argued that there is always) a positive probability of 
confounding the realized value of X with another possible value committing thus an 
observation error. 

Of course, if the probability of an observation error is very small and we consider 
one of the common Bayesian model proposed to learn under prior ignorance, then 
there is little difference between the results provided by a latent variable model 
modeling correctly the observational process, and the results provided by a model 
where the observations are assumed to be perfect. For this reason, the observational 
process is often neglected in practice and the distinction between the latent variable 
and the manifest one is not enforced. 

But, on the other hand, if we consider sets of probability distributions to model 
our prior beliefs, instead of a single probability distribution, and in particular if 



Elsewhere, this is also called the measurement process. 
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we consider near-ignorance sets of priors, then there can be an extreme difference 
between a latent variable model and a model where the observations are considered 
to be perfect, so that learning may be impossible in the first model and possible in 
the second. As a consequence, when dealing with sets of probability distributions, 
neglecting the observational process may be no longer justified even if the proba- 
bility of observation error is tiny. This is shown in a definite sense in Example M 
of Section I4.3[ where we analyze the relevance of our results for the special case 
of the IDM. From the proofs in this paper, it follows that this kind of behavior is 
mainly determined by the presence, in the near-ignorance set of priors, of extreme, 
almost-deterministic, distributions. And the question is that these problematic dis- 
tributions, which are usually not considered when dealing with Bayesian models 
with a single prior, cannot be ruled out without dropping near-ignorance. 

These considerations highlight the quite general applicability of the present re- 
sults and raise hence serious doubts about the possibility to adopt a condition of 
prior near-ignorance in real, as opposed to idealized, applications of statistics. As a 
consequence, it may make sense to consider re-focusing the research about this sub- 
ject on developing models of very weak states of belief that are, however, stronger 
than near-ignorance. This might also involve dropping the idea that both the sym- 
metry and the embedding principles can be realistically met in practice. 

2 Categorical Latent Variables 

In this paper, we follow the general definition of latent and manifest variables given 
by Skrondal and Rabe-Hasketh [SRH04J: a latent variable is a random variable 
whose realizations are unobservable (hidden), while a manifest variable is a random 
variable whose realizations can be directly observed. 

The concept of latent variable is central in many sciences, like for example psy- 
chology and medicine. Skrondal and Rabe-Hasketh list several fields of application 
and several phenomena that can be modelled using latent variables, and conclude 
that latent variable modeling "pervades modern mainstream statistics" although 
"this omni-presence of latent variables is commonly not recognized, perhaps because 
latent variables are given different names in different literatures, such as random 
effects, common factors and latent classes" or hidden variables. 

But what are latent variables in practice? According to Boorsbom et al. 
[BMvH02j, there may be different interpretations of latent variables. A latent vari- 
able can be regarded, for example, as an unobservable random variable that exists 
independently of the observation. An example is the unobservable health status of 
a patient that is subject to a medical test. Another possibility is to regard a latent 
variable as a product of the human mind, a construct that does not exist indepen- 
dently of the observation. For example the unobservable state of the economy, often 
used in economic models. In this paper, we assume the existence of a latent categor- 
ical random variable X, with outcomes in X = {x\, . . . , and unknown chances 
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■& G 6 := {t? = . . . , | X/i=i ^* = 1j < < 1}, without stressing any par- 
ticular interpretation. Throughout the paper, we denote by $ a particular vector of 
chances in and by a random variable on G. 

Now, let us focus on a bounded real-valued function / defined on 9, where 
1? 6 9 are the unknown chances of X. We aim at learning the value /(#) using 
n realizations of the variable X. Because the variable X is latent and therefore 
unobservable by definition, the only way to learn /(i?) is to observe the realizations 
of some manifest variable S related, through known probabilities P(S|X), to the 
(unobservable) realizations of X. An example of known probabilistic relationship 
between latent and manifest variables is the following. 

Example 1 Consider a binary medical diagnostic test used to assess the health 
status of a patient with respect to a given disease. The accuracy of a diagnostic 
testH is determined by two probabilities: the sensitivity of a test is the probability of 
obtaining a positive result if the patient is diseased; the specificity is the probability 
of obtaining a negative result if the patient is healthy. Medical tests are assumed to 
be imperfect indicators of the unobservable true disease status of the patient. There- 
fore, we assume that the probability of obtaining a positive result when the patient 
is healthy, respectively of obtaining a negative result if the patient is diseased, are 
non-zero. Suppose, to make things simpler, that the sensitivity and the specificity 
of the test are known. In this example, the unobservable health status of the patient 
can be considered as a binary latent variable X with values in the set {Healthy, 111}, 
while the result of the test can be considered as a binary manifest variable S with 
values in the set {Negative result, Positive result}. Because the sensitivity and the 
specificity of the test are known, we know P(S | X). ^> 

We continue discussion about this example later on, in the light of our results, 
in Example H] of Section HI 

3 Near-ignorance sets of priors 

Consider a categorical random variable X with outcomes in X = {xi, . . . , Xk} and 
unknown chances $ G 0. Suppose that we have no relevant prior information about 
■d and we are therefore in a situation of prior ignorance about X. How should we 
model our prior beliefs in order to reflect the initial lack of knowledge? 

Let us give a brief overview of this topic in the case of coherent models of un- 
certainty, such as Bayesian probability theory and Walley's theory of coherent lower 
previsions. 

In the traditional Bayesian setting, prior beliefs are modelled using a single prior 
probability distribution. The problem of defining a standard prior probability dis- 
tribution modeling a situation of prior ignorance, a so-called non-informative prior, 

2 For further details about the modeling of diagnostic accuracy with latent variables see Yang 
and Becker }YB97| . 



5 



has been an important research topic in the last two centuries^ and, despite the 
numerous contributions, it remains an open research issue, as illustrated by Kass 
and Wassermann |KW96] . See also Hutter [Hut06j for recent developments and 
complementary considerations. There are many principles and properties that are 
desirable when the focus is on modeling a situation of prior ignorance, and that 
have indeed been used in past research to define non-informative priors. For ex- 
ample Laplace's symmetry or indifference principle has suggested, in case of finite 
possibility spaces, the use of the uniform distribution. Other principles, like for 
example the principle of invariance under group transformations, the maximum en- 
tropy principle, the conjugate priors principle, etc., have suggested the use of other 
non-informative priors, in particular for continuous possibility spaces, satisfying one 
or more of these principles. But, in general, it has proven to be difficult to de- 
fine a standard non-informative prior satisfying, at the same time, all the desirable 
principles. 

We follow Walley |Wal96j and de Cooman and Miranda [DCM06J when they say 
that there are at least two principles that should be satisfied to model a situation 
of prior ignorance: the symmetry and the embedding principles. The symmetry 
principle states that, if we are ignorant a priori about i9, then we have no reason to 
favour one possible outcome of X over another, and therefore our probability model 
on X should be symmetric. This principle is equivalent to Laplace's symmetry or 
indifference principle. The embedding principle states that, for each possible event 
A, the probability assigned to A should not depend on the possibility space X in 
which A is embedded. In particular, the probability assigned a priori to the event 
A should be invariant with respect to refinements and coarsenings of X. 

It is easy to show that the embedding principle is not satisfied by the uniform 
distribution. How should we model our prior ignorance in order to satisfy these two 
principles? Walle;y0 gives what we believe to be a compelling answer to this question: 
he proves that the only coherent probability model on X consistent with the two 
principles is the vacuous probability model, i.e., the model that assigns, for each 
non-trivial event A, lower probability P_(A) = and upper probability P(A) = 1. 
Clearly, the vacuous probability model cannot be expressed using a single probability 
distribution. It follows then, if we agree that the symmetry and the embedding 
principles are characteristics of prior ignorance, that we need imprecise probabilities 
to model such a state of beliefs|f| Unfortunately, it is easy to show that updating 
the vacuous probability model on X produces only vacuous posterior probabilities. 
Therefore, the vacuous probability model alone is not a viable way to address our 
initial problem. Walley suggests, as an alternative, the use of near-ignorance sets of 
priorsU 

Starting from the work of Laplace at the beginning of the 19 th century |Lap51] . 
4 In Walley |Wal91j . Note 7 at p. 526. See also Section 5.5 of the same book. 
5 For a complementary point of view, see Hutter [HutOGj. 

6 Walley calls a set of probability distributions modeling near-ignorance a near-ignorance prior. 
In this paper we use the term near-ignorance set of priors in order to avoid confusion with the 
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A near-ignorance set of priors is a probability model on the chances 6 of X, 
modeling a very weak state of knowledge about 6. In practice, a near-ignorance set 
of priors is a large closed convex set M.q of prior probability densities on 6 which 
produces vacuous expectations for various but not all functions / on B, i.e., such 
that E(/) = inftf 6 e /(<?) and E(/) = su Pl9ee /(#). 

The key point here is that near-ignorance sets of priors can be designed so as to 
satisfy both the symmetry and the embedding principles. In fact, if a near-ignorance 
set of priors produces vacuous expectations for all the functions /($) = for each 
i G {l,...,k}, then, because a priori P(X = Xj) = E(8i), the near-ignorance set 
of priors implies the vacuous probability model on X and satisfies therefore both 
the symmetry and the embedding principle, thus delivering a satisfactory model of 
prior near-ignoranceQ Updating a near-ignorance prior consists in updating all the 
probability densities in .Mo using Bayes' rule. Since the beliefs on are not vacuous, 
this makes it possible to calculate non-vacuous posterior probabilities for X. 

A good example of near-ignorance set of priors is the set AAq used in the im- 
precise Dirichlet model. The IDM models a situation of prior near-ignorance about 
a categorical random variable X. The near-ignorance set of priors A4g used in the 
IDM consists of the set of all Dirichlet densities^ = dir s t { r d) for a fixed s > 
and all t G T, where 

dir s J0) := TT tff- 1 , (1) 



i=i 



and 

fc 

T :={t = (*!,. ..,t k ) I ^t* = l, 0<tj < 1}. (2) 

j=i 

The particular choice of A^o m the IDM implies vacuous prior expectations for all 
the functions /(#) = $f , for all integers R > 1 and alH G {1, . . . , i.e., E(0f ) = 
and E(#^) = 1. Choosing R — 1, we have, a priori, 

P(X = a*) = E(0 f ) = 0, P(X = Xi ) = E(6 t ) = 1. 

It follows that the particular near- ignorance set of priors M.q used in the IDM 
implies a priori the vacuous probability model on X and, therefore, satisfies both 
the symmetry and embedding principles. On the other hand, the particular set 
of priors used in the IDM does not imply vacuous prior expectations for all the 
functions /(#). For example, vacuous expectations for the functions /(#) = "&i ■ $j 
for z ^ j would be E(^ • •&■) = and E(^ • ^) = 0.25, but in the IDM we have 



precise Bayesian case. 

7 We call this state near-ignorance because, although we are completely ignorant a priori about 
X, we are not completely ignorant about [Wal91| Section 5.3, Note 4]. 

8 Throughout the paper, if no confusion is possible, we denote the outcome 8 = i9 by For 
example, we denote p(9 — i?) by 
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a priori E(i9j • $j) < 0.25 and the prior expectations are therefore not vacuous. 
In Walley |Wal96j . it is shown that the IDM produces, for each observed dataset, 
non- vacuous posterior probabilities for X. 

4 Limits of Learning under Prior Near-Ignorance 

Consider a sequence of independent and identically distributed (IID) categorical 
latent variables (Xj)j 6 jy with outcomes in X and unknown chances 0, and a sequence 
of independent manifest variables (Sj) ig N, which we allow to be defined either on 
finite or infinite spaces. We assume that a realization of the manifest variable Sj can 
be observed only after a (hidden) realization of the latent variable X*. Furthermore, 
we assume Sj to be independent of the chances 6 of Xj conditional on Xj, i.e., 

P(S, | X, = Xj , & = ■&) = P(S t | X, = Xj), (3) 

for each x* G X and i? G 6^ These assumptions model a two-step process where 
the variable Sj is used to convey information about the realized value of Xj for each 
i, independently of the chances of Xj. The (in) dependence structure can be depicted 
graphically as follows, 




where the framed part of this structure is what we call an observational process. 

To make things simpler, we assume the probability distribution P(Sj | Xj = Xj) 
to be precise and known for each Xj G X and each i G N. 

We divide the discussion about the limits of learning under prior near-ignorance 
in three subsections. In Section 14.11 we discuss our general parametric problem 
and we obtain a condition that, if satisfied, prevents learning to take place. In 
Section 14.21 we study the consequences of our theoretical results in the particular 
case of predictive probabilities. Finally, in Section I4.3[ we focus on the particular 
near-ignorance set of priors used in the IDM and we obtain necessary and sufficient 
conditions for learning with categorical manifest variables. 

4.1 General parametric inference 

We focus on a very general problem of parametric inference. Suppose that we 
observe a dataset s of realizations of the manifest variables Si, . . . , S n related to the 
(unobservable) dataset x G X n of realizations of the variables Xj, . . . ,X n . Defining 

9 We denote usually by P a probability (discrete case) and with p a probability density (contin- 
uous case). If an expression holds in both the discrete and the continuous case, like for example 
Equation ([3]), then we use P to indicate both cases. 
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the random variables X := (Xi,...,X n ) and S := (Si,...,S n ), we have S = s 
and X = x. To simplify notation, when no confusion can arise, we denote in 
the rest of the paper S = s with s. Given a bounded function /(#), our aim is 
to calculate E(/ | s) and E(/ | s) starting from a condition of ignorance about /, 
i.e., using a near ignorance prior A4 , such that E(/) = f min := inf^e /($) and 
E(/) = / max := su Pl?e0 /(tf). 

Is it really possible to learn something about the function /, starting from a 
condition of prior near-ignorance and having observed a dataset s? The following 
theorem shows that, very often, this is not the case. In particular, Corollary [3] shows 
that there is a condition that, if satisfied, prevents learning to take place. 

Theorem 2 Let s be given. Consider a bounded continuous function f defined on 
6 and a near-ignorance set of priors M.q. Then the following statements holdft\ 

1. If the likelihood function P(s | D) is strictly positiv^\ in each point in which f 
reaches its maximum value f max , is continuous in an arbitrary small neighbor- 
hood of those points, and Aio is such that a priori E(/) = / maX ; then 

E(/ | s) = E(/) = / max . 

2. If the likelihood function P(s | ■&) is strictly positive in each point in which f 
reaches its minimum value / mm , is continuous in an arbitrary small neighbor- 
hood of those points, and M.q is such that a priori E(/) = / m i n , then 

E(/|s)=E(/) = / mm . 

Corollary 3 Consider a near-ignorance set of priors M.q. Let s be given and let 
P(s I'd) be a continuous strictly positive function on 0. If M.q is such that E(/) = 
/ min and E(f) = / max; then 

W\S) = E_(/)=/ mm , 

E(/|s) = E(/) = / max . 

In other words, given s, if the likelihood function is strictly positive, then the func- 
tions / that, according to Ai , have vacuous expectations a priori, have vacuous 
expectations also a posteriori, after having observed s. It follows that, if this suf- 
ficient condition is satisfied, we cannot use near-ignorance priors to model a state 

10 The proof of this theorem is given in the appendix, together with all the other proofs of the 
paper. 

11 In the appendix it is shown that the assumptions of positivity of P(s | ■&) in Theorem^ can 
be substituted by the following weaker assumptions. For a given arbitrary small S > 0, de- 
note by &s the measurable set, 65 := {# S 6 | /(#) > / max — 5}. If P(s|t?) is such that, 
lim^o inf^ 6 e a P( s I = c > 0, then Statement 1 of Theorem [2] holds. The same holds for 
the second statement, substituting 65 with 65 := {i? £ & | /(#) < / m i n + S}. 
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of prior ignorance because only vacuous posterior expectations are produced. The 
sufficient condition described above is met very easily in practice, as shown in the 
following two examples. In the first example, we consider a very simple setting 
where the manifest variables are categorical. In the second example, we consider 
a simple setting with continuous manifest variables. We show that, in both cases, 
the sufficient condition is satisfied and therefore we are unable to learn under prior 
near-ignorance. 

Example 4 Consider the medical test introduced in Example [1] and an (ideally) 
infinite population of individuals. Denote by the binary variable Xj G {H, 1} the 
health status of the i-th individual of the population and with Sj G {+, — } the 
results of the diagnostic test applied to the same individual. We assume that the 
variables in the sequence (Xj)j g N are IID with unknown chances 1 — where $ 
corresponds to the (unknown) proportion of diseased individuals in the population. 
Denote by 1 — e± the specificity and with 1 — e 2 the sensitivity of the test. Then it 
holds that 

P(S, = + |X< = H) = e x > 0, P(S, = — \ Xi = I) = e 2 > 0, 

where (I,H,+,—) denote (patient ill, patient healthy, test positive, test negative). 

Suppose that we observe the results of the test applied to n different individuals 
of the population; using our previous notation we have S = s. For each individual 
we have, 

= /)P(Xi = I\S) + P(Si = + I X, = H)P(Xi = H\0) 
ei -(l-i?) > 0. 
>o 



P(S, = + |0) 
=P(S i = + |X, 
= (l-e 2 ).0 + 



Analogously, 

P(Si = - I 0) 

=P(Sj = - I Xi = J)P(X i = I\#) + P(S< = - I X, = H)P(Xi = H\0) 
= e 2 •0+(l-£ 1 )-(l-0) > 0. 

>o >o 



Denote by n s the number of positive tests in the observed sample s. Since the 
variables S« are independent, we have 

P(S = s|tf) = ((1 - e 2 ) • <? + ei • (1 - 0)) nS • (e 2 • t? + (1 - £1) • (1 - 0)) n " nS > 
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for each d G [0, 1] and each s G X n . Therefore, according to Corollary [3j all the 
functions / that, according to A4q, have vacuous expectations a priori have vacuous 
expectations also a posteriori. It follows that, if we want to avoid vacuous posterior 
expectations, then we cannot model our prior knowledge (ignorance) using a near- 
ignorance set of priors. This simple example shows that our previous theoretical 
results raise serious questions about the use of near-ignorance sets of priors also in 
very simple, common, and important situations. ^> 

Example H] focuses on categorical latent and manifest variables. In the next 
example, we show that our theoretical results have important implications also in 
models with categorical latent variables and continuous manifest variables. 

Example 5 Consider a sequence of IID categorical variables (Xj)j g N with outcomes 
in X n and unknown chances 6 G O. Suppose that, for each i > 1, after a realization 
of the latent variable Xj, we can observe a realization of a continuous manifest 
variable Sj. Assume that p(Sj | Xj = Xj) is a continuous positive probability density, 
e.g., a normal iV(/z,,-, <r|) density, for each Xj G X. We have 

p(S, | ■&) = V piS, I X, = Xj ) ■ P(X, = Xj \G)= J2 I X * = x i) -#3 > °> 

because $j is positive for at least one j G {1, . . . , k} and we have assumed Sj to be 
independent of 6 given Xj. Because we have assumed (Sj)j 6 N to be a sequence of 
independent variables, we have 

n 

J9(S = s|^) = TTj9(S l = S i |^)>0. 

1 v ' 

1=1 >0 

Therefore, according to Corollary [3], if we model our prior knowledge using a near- 
ignorance set of priors Mo, the vacuous prior expectations implied by M.$ remain 
vacuous a posteriori. It follows that, if we want to avoid vacuous posterior expecta- 
tions, we cannot model our prior knowledge using a near-ignorance set of priors. <0> 

Examples H] and [5] raise, in general, serious criticisms about the use of near- 
ignorance sets of priors in real applications. 

4.2 An important special case: predictive probabilities 

We focus now on a very important special case: that of predictive inference J^l Sup- 
pose that our aim is to predict the outcomes of the next n' variables X n+1 , . . . , X n+n i. 
Let X' := (X n+ i, . . . ,X n+n /). If no confusion is possible, we denote X' = x' by x'. 

12 For a general presentation of predictive inference see Geisser |Gei93 : for a discussion of the 
imprecise probability approach to predictive inference see Walley and Bernard [WB99 . 
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Given x' G X n , our aim is to calculate P(x' | s) and P(x' | s). Modeling our prior ig- 
norance about the parameters 6 with a near- ignorance set of priors AAq and denoting 
by n' := (n' l5 . . . , n' k ) the frequencies of the dataset x', we have 

P(x' | s) = inf P p (x' | s) = inf f TT \ s)d<& = 



where, according to Bayes' rule, 

pifl | s) 



,i=l / \i=l 



P(s| tf)p(tf) 
/ e P(s|#)p(#)d# ! 



provided that J e P(s\'d)p('d)d'd ^ 0. Analogously, substituting sup to inf in 
we obtain 



P(x'|s)=E [[CIS ■ (4) 



v «=i 



Therefore, the lower and upper probabilities assigned to the dataset x' a priori (a 
posteriori) correspond to the prior (posterior) lower and upper expectations of the 
continuous bounded function /(#) = n^i^- 

It is easy to show that, in this case, the minimum of / is and is reached in 
all the points E with di = for some i such that > 0, while the maximum 
of / is reached in a single point of corresponding to the relative frequencies f 
of the sample x', i.e., at P = (^yjr, • • • , G 0, and the maximum of / is given 

by Yli=i ( ^/ ) -ft follows that the maximally imprecise probabilities regarding the 
dataset x , given that x' has been generated by a multinomial process, are given by 

em = e (n c ; ) = o. p(xo = e (n <?) = n (£) "' ■ 

The general results stated in Section H~T1 hold also in the particular case of predictive 
probabilities. In particular, Corollary [3] can be rewritten as follows. 

Corollary 6 Consider a near-ignorance set of priors Aio- Let s be given and let 
P(s I'd) be a continuous strictly positive function on 0. Then, if A4q implies prior 
probabilities for a dataset x' G X n that are maximally imprecise, the predictive 
probabilities of x' are maximally imprecise also a posteriori, after having observed 
s, i.e., 



P(x' | s) = P(x') = 0, P(x' | s) = P(x') = I] 
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4.3 Predicting the next outcome with categorical manifest 
variables 



In this section we consider a special case for which we give necessary and sufficient 
conditions to learn under prior near-ignorance. These conditions are then used to 
analyze the IDM. 

We assume that all the manifest variables in S are categorical. Given an arbi- 
trary categorical manifest variable Si, denote by S l := {si,...,s n .} the finite set 
of possible outcomes of Sj. The probabilities of Sj are defined conditional on the 
realized value of Xj and are given by 



where h G {1, . . . , rii) and j G {1, . . . ,k}. The probabilities of Si can be collected in 
k stochastic matrix A Sl defined by 



which is called emission matrix of Sj. 

Our aim, given s, is to predict the next (latent) outcome starting from prior near- 
ignorance. In other words, our aim is to calculate P(X n+ i = xj | s) and P(X n+ i = 
Xj | s) for each Xj G X, using a set of priors M.q such that P(X n+1 = Xj) = and 



A possible near-ignorance set of priors for this problem is the set Aio used in the 
IDM. We have seen, in Section [3J that this particular near-ignorance set of priors 
is such that P(X n+1 = Xj) = and P(Xn +1 = Xj) = 1 for each Xj G X. For this 
particular choice, the following theorem 13 ! states necessary and sufficient conditions 
for learning. 

Theorem 7 Let A(Sj) be the emission matrix of Si for i — 1, . . . ,n. Let A4 be the 
near-ignorance set of priors used in the IDM. Given an arbitrary observed dataset 
s, we obtain a posteriori the following inferences. 

1. If all the elements of matrices A(Sj) are nonzero, then, P(X n+ i = Xj | s) = 1, 
P(X„ + i = Xj | s) = 0, for every Xj G X . 

2. P(X n+1 = Xj | s) < 1 for some Xj G X , iff we observed at least one manifest 
variable Sj = Sh such that Xhj(Si) = 0. 

3. P(X n+1 = Xj | s) > for some Xj G X, iff we observed at least one manifest 
variable Sj = Sh such that A/y(Sj) ^ and A/ ir (Sj) = for each r ^ j in 



A/y(Sj) :— -P(Sj — Sh | Xj — Xj), 




P(X„ +1 = Xj) = 1 for each Xj G X. 



{l,...,k}. 



Theorem [7] is a slightly extended version of Theorem 1 in Piatti et al. |PZT05j . 
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In other words, to avoid vacuous posterior predictive probabilities for the next out- 
come, we need at least a partial perfection of the observational process. Some simple 
criteria to recognize settings producing vacuous inferences are the following. 



1. If the j-th columns of matrices A (Si) have all nonzero elements, then, for each 
s, P(X n+1 = ij |s) = 1. 

2. If the j-th rows of matrices A(S») have more than one nonzero element, then, 
for each s, P(X n+ i — Xj | s) = 0. 

Example 9 Consider again the medical test of Example HI The manifest variable 
Sj (the result of the medical test applied to the i-th individual) is a binary variable 
with outcomes positive (+) or negative (— ). The underlying latent variable Xj (the 
health status of the z-th individual) is also a binary variable, with outcomes ill (I) 
or healthy (if). The emission matrix in this case is the same for each i 6 N and is 
the 2x2 matrix, 



All the elements of A are different from zero. Therefore, using as set of priors the 
near-ignorance set of priors M.q of the IDM, according to Theorem 0, we are unable 
to move away from the initial state of ignorance. This result confirms, in the case 
of the near-ignorance set of priors of the IDM, the general result of Example HI 

It is interesting to remark that it is impossible to learn for arbitrarily small values 
of E\ and e 2 > provided that they are positive. It follows that there are situations 
where the observational process cannot be neglected, even when we deem it to be 
imperfect with tiny probability. This point is particulary interesting when compared 
to what would be obtained using a model with a single non-informative prior. In 
this case, the difference between a model with perfect observations and a model 
that takes into account the probability or error would be very small and therefore 
the former model would be used instead of the latter. Our results show that this 
procedure, that is almost an automatism when using models with a single prior, 
may not be justified in models with sets of priors. The point here seems to be that 
the amount of imperfection of the observational process should not be evaluated in 
absolute terms; it should rather be evaluated in comparison with the weakness of 
the prior beliefs. <0 

The previous example has been concerned with the case in which the IDM is 
applied to a latent categorical variable. Now we focus on the original setup for 
which the IDM was conceived, where there are no latent variables. In this case, it 
is well known that the IDM leads to non-vacuous posterior predictive probabilities 
for the next outcome. In the next example, we show how such a setup makes the 
IDM avoid the theoretical limitations stated in Section 14.11 



Corollary 8 Under the assumptions of Theorem^ the following criteria hold: 
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Example 10 In the IDM, we assume that the IID categorical variables (X»)i e N 
are observable. In other words, we have S» = X« for each % > 1 and therefore 
the IDM is not a latent variable model. The IDM is equivalent to a model with 
categorical manifest variables and emission matrices equal to the identity matrix /. 
Therefore, according to the second and third statements of Theorem [TJ if x contains 
only observations of the type Xj, then 

P(X n+ i = Xj | x) > , P(X n+ i = xj | x) = 1, 

P(X n+ i = x h | x) = , P(X n+ i = x h | x) < 1, 
for each h ^ j. Otherwise, for all the other possible observed dataset x, 

P(X n+1 = Xj | x) > , P(X n+1 = x j \x)<l, 

for each j G {l,...,k}. It follows that, in general, the IDM produces, for each 
observed dataset x, non-vacuous posterior predictive probabilities for the next out- 
come. 

The IDM avoids the theoretical limitations highlighted in Section 14.11 thanks to 
its particular likelihood function. Having observed S = X = x, we have 

k 

p(s = x|#) = p(x= xi#) =n^, 

i=i 

where rij denotes the number of times that Xi G X has been observed in x. We have 
P(X = x | •&) = for all such that = for at least one j such that rij > and 
P(X = x 1 1?) > for all the other G O, in particular for all # in the interior of 0. 

Consider, to make things simpler, that in x at least two different outcomes 
have been observed. The posterior predictive probabilities for the next outcome are 
obtained calculating the lower and upper expectations of the function /(#) = $j for 
all j G {1, . . . , k}. This function reaches its minimum (f m in = 0) if dj = and its 
maximum (/ m i n = 1) if dj = 1. Therefore, the points where the function /(#) = "dj 
reaches its minimum, resp. its maximum, are on the boundary of and it is easy 
to show that the likelihood function equals zero at least in one of these points. It 
follows that the positivity assumptions of Theorem [2] are not met. ^> 

Example [TU1 shows that we are able to learn, using a near-ignorance set of priors, 
only if the likelihood function P(s | •&) is equal to zero in some critical points. The 
likelihood function of the IDM is very peculiar, being in general equal to zero on 
some parts of the boundary of 0, and allows therefore to use a near-ignorance set of 
priors M.q that models in a satisfactory way a condition of prior (near-) ignorance^ 

Yet, since the variables (Xj) ieN are assumed to be observable, the successful ap- 
plication of a near-ignorance set of priors in the IDM is not helpful in addressing the 
doubts raised by our theoretical results about the applicability of near-ignorance set 
of priors in situations, where the variables (Xj)j € N are latent, as shown in Example 

14 See Walley |Wal96j and Bernard [Ber05j for an in-depth discussion on the properties of the 
IDM. 
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5 On modeling observable quantities 



In this section, we discuss three alternative approaches that, at a first sight, might 
seem promising to overcome the problem of learning under prior near-ignorance. For 
the sake of simplicity, we consider the particular problem of calculating predictive 
probabilities for the next outcome and a very simple setting based on the IDM. The 
alternative approaches are based on trying to predict the manifest variable rather 
than the latent one, thus changing perspective with respect to the previous sections. 
This change of perspective is useful to consider also because on some occasions, e.g., 
when the imperfection of the observational process is considered to be low, one may 
deem sufficient to focus on predicting the manifest variable. We show, however, that 
the proposed approaches eventually do not solve the mentioned learning question, 
which remains therefore an open problem. 

Let us introduce in detail the simple setting we are going to use. Consider a se- 
quence of independent and identically distributed categorical binary latent variables 
(Xj)j £ N with unknown chances 6 = (9%, 6 2 ) — (9%, 1 — 9i), and a sequence of IID bi- 
nary manifest variables (Sj)j g N with the same possible outcomes. Since the manifest 
variables are also IID, then they can be regarded as the product of an overall multino- 
mial data-generating process (that includes the generation of the latent variables as 
well as the observational process) with unknown chances £ := (^1,^2) = 1 — 
Suppose that the emission matrix A is known, constant for each % and strictly diag- 
onally dominant, i.e., 



with £1,82 7^ 0, Si < 0.5 and £2 < 0.5. This simple matrix models the case in 
which, for each i, we are observing the outcomes of the random variable X, but 
there is a positive probability of confounding the actual outcome of Xj with the 
other one. The random variable Sj represents our observation, while Xj represents 
the true value. A typical example for this kind of situation is the medical example 
discussed in Examples H] and [91 Suppose that we have observed S = s and our aim 
is to calculate P(X n+ i = x\ | s) and P(X n+ i = x\ | s). 

In the previous sections we have dealt with this problem by modeling our ig- 
norance about the chances of X n+ i with a near-ignorance set of priors and then 
calculating P(X n+1 = X\ | s) and P(X n+1 = X\ | s). But we already know from Ex- 
ample H] that in this case we obtain vacuous predictive probabilities, i.e., 



Because this approach does not produce any useful result, one could be tempted 
to modify it in order to obtain non-vacuous predictive probabilities for the next 
outcome. We have identified three possible alternative approaches that we discuss 
below. The basic structure of the three approaches is identical and is based on the 
idea of focusing on the manifest variables, that are observable, instead of the latent 
variables. The proposed structure is the following: 




P(X n+1 = Xl I s) = 0, 



P(X n+1 =xi|s) = 1. 
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• specify a near-ignorance set of priors for the chances £ of S n+ i; 

• construct predictive probabilities for the manifest variables, i.e., 

P(S n +i = xi | s), P(S n+ i = xi | s); 

• use the predictive probabilities calculated in the previous point to say some- 
thing about the predictive probabilities 

P(X n+ i = X\ I s), P(X n+1 = x\ I s). 

The three approaches differ in the specification of the near-ignorance set of priors 
for £ and on the way P(S n+ i = X\ | s) and P(S n+ i = X\ | s) are used to reconstruct 
P(X n+ i = xx | s) and P(X n+ i = x x | s). 

The first approach consists in specifying a near-ignorance set of priors for the 
chances £ taking into consideration the fact that these chances are related to the 
chances 6 through the equation 

£i = (l-e 2 )-0i + £i-(l-0i), 

and therefore we have £1 G [e±, 1 — £ 2 ]. A possible way to specify correctly a near- 
ignorance set of priors in this case is to consider the near-ignorance set of priors M.q 
of the IDM on 6, consisting of standard beta(s, t) distributions, and to substitute 



l-(ei + e 2 )' l-(ei + e 2 )' 

into all the prior distributions in A4 . We obtain thus a near- ignorance set of priors 
for £ consisting of beta distributions scaled on the set [e±, 1 — e 2 ], i-e., 



1 - (ei + e 2 ) V 1 - (ei + £2) J V 1 ~ ( £ i + £2) 

where C := r( s t^)r(st 2 j ' P u ^' sca hng the distributions, we incur the same problem we 
have incurred with the IDM for the latent variable. Suppose that we have observed 
a dataset s containing n\ times the outcome x\ and n — n\ times the outcome 
x 2 . The likelihood function in this case is given by L(£ l5 £ 2 ) = £™ J ■ (1 — ^iY n ~ ni \ 
Because ^ G [ei, 1 — £ 2 ] the likelihood functions is always positive and therefore 
the extreme distributions that are present in the near-ignorance set of priors for 
£ produce vacuous expectations for £1, i.e., ^(^i | s) = 1 — e 2 and E(£i \ s) = e\. 
It follows that this approach does not solve our theoretical problem. Moreover, it 
follows that the inability to learn is present under near-ignorance even when we 
focus on predicting the manifest variable! 

The second, more naive, approach consists in using the near-ignorance set of 
priors M.§ used in the standard IDM to model ignorance about £. In this way we 
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are assuming (wrongly) that £1 G [0,1], ignoring thus the fact that £1 G [s±, 1 — £2] 
and therefore implicitly ignoring the emission matrix A. Applying the standard 
IDM on £ we are able to produce non-vacuous probabilities P(S n+ i = x±\s) and 
P(S n +i = £Ci I s). Now, because A is known, knowing the value of P(S n +i = x\ | s) 
it is possible to reconstruct P(X n+1 = x\ | s). But this approach, that on one hand 
ignores A and on the other hand takes it into consideration, is clearly wrong. For 
example, it can be easily shown that it can produce probabilities outside [0, 1]. 

Finally, a third possible approach could be to neglect the existence of the latent 
level and consider S n +i to be the variable of interest. Applying the standard IDM 
on the manifest variables we are clearly able to produce non vacuous probabilities 
P(S n+ i = X\ I s) and P(S n+ i = x\ | s) that are then simply used instead of the 
probabilities P(X n+1 = x\ | s) and P(X n+1 = X\ | s) in the problem of interest. This 
approach is the one typically followed by those who apply the IDM in practical 
problems!^! This approach requires the user to assume perfect observability; an 
assumption that appears to be incorrect in most (if not all) real statistical problems. 
And yet this procedure, despite being wrong or hardly justifiable from a theoretical 
point of view, has produced in several applications of the IDM useful results, at least 
from an empirical point of view. This paradox between our theoretical results and 
the current practice is an open problem that deserves to be investigated in further 
research. 

6 Conclusions 

In this paper we have proved a sufficient condition that prevents learning about a 
latent categorical variable to take place under prior near-ignorance regarding the 
data-generating process. 

The condition holds as soon as the likelihood is strictly positive (and continu- 
ous), and so is satisfied frequently, even in the more common and simple settings. 
Taking into account that the considered framework is very general and pervasive of 
statistical practice, we regard this result as a form of strong evidence against the 
possibility to use prior near-ignorance in real statistical problems. Given also that 
prior near-ignorance is arguably a privileged way to model a state of ignorance, our 
results appear to substantially reduce the hope to be able to adopt a form of prior 
ignorance to do objective-minded statistical inference. 

With respect to future research, two possible research directions seem to be 
particularly important to investigate. 

As reported by Bernard [Ber05] . near-ignorance sets of priors, in the specific 
form of the IDM, have been successfully used in a number of applications. On the 
other hand, the theoretical results presented in this paper point to the impossibility 
of learning in real statistical problems when starting from a state of near- ignorance. 
This paradox between empirical and theoretical results should be investigated in 

15 See Bernard |Ber05j for a list of applications of the IDM. 
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order to better understand the practical relevance of the theoretical analysis pre- 
sented here, and more generally to explain the mechanism behind such an apparent 
contradiction. 

The proofs contained in this paper suggest that the impossibility of learning 
under prior near-ignorance with latent variables is mainly due to the presence, in 
the set of priors, of extreme distributions arbitrarily close to the deterministic ones. 
Some preliminary experimental analyses have shown that learning is possible as 
soon as one restricts the set of priors so as to rule out the extreme distributions. 
This can be realized by defining a notion of distance between priors and then by 
allowing a distribution to enter the prior set of probability distributions only if it is 
at least a certain positive distance away from the deterministic priors. The minimal 
distance can be chosen arbitrarily small (while remaining positive), and this allows 
one to model a state of very weak beliefs, close to near-ignorance. Such a weak 
state of beliefs could keep some of the advantages of near-ignorance (although it 
would clearly not be a model of ignorance) while permitting learning to take place. 
The main problem of this approach is the justification, i.e., the interpretation of the 
(arbitrary) restriction of the near-ignorance set of priors. A way to address this issue 
might be to identify a set of desirable principles, possibly similar to the symmetry 
and embedding principles, leading in a natural way to a suitably large set of priors 
describing a state close to near-ignorance. 
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A Technical preliminaries 

In this appendix we prove some technical results that are used to prove the theorems 
in the paper. First of all, we introduce some notation used in this appendix. Consider 
a sequence of probability densities (p n )ne~N and a function / defined on a set 6. Then 
we use the notation 



and with — > we denote lim^oo. 

Theorem 11 Let C R fe be the closed k- dimensional simplex and let (p n )neN be a 
sequence of probability densities defined on w.r.t. the Lebesgue measure. Let f > 
be a bounded continuous function on and let / max := sup (/) and / min := infe(/). 
For this function define the measurable sets 




Qs = {#eQ\ > /, 



max 



(5) 
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Q6 = {#ee\f(ti)<f min + 5}. (6) 

1. Assume that (p n ) n gN concentrates on a maximum of f for n — > oo ; in the 
sense that 

E n (/) — > /max, (7) 

then, for all 5 > 0, it holds 

Assume that (jo n ) ne N concentrates on a minimum of f for n — ► oo, m i/ie 
sense i/iai 

E n (/) ~ ► /min, (8) 

i/jen ; /or a// 5 > 0, holds 

Pn^) - 1. 

Proof. We begin by proving the first statement. Let 5 > be arbitrary and 
6,5 := 6 \ 6,5. From (jSJ) we know that on 65 it holds /(#) > / max _ 5, an d therefore 
on 6,5 we have /(#) < /max — 5, and thus 



/max /(^) ^ 1 
5 - ' 



(9) 



It follows that 



1-P„(0,) = P n (e 5 )= [ Pn{*)**<f ^ /^ PnWtf 
< jf ^ ~ m Pn&)d4 = ~(/ max ~ E n (/)) 0, 

and therefore P n (6,s) — > 1 and thus the first statement is proved. To prove the 
second statement, let 5 > be arbitrary and 6,5 := 6 \ 65. From ([6]) we know that 
on 65 it holds /(#) < / m in + 8, and therefore on 65 we have /(#) > / m in + 5, and 
thus 

/(#) ~ /min > L ^ 

It follows that 

1-P n (6 5 ) = P n (B 5 )= [ p„(0)dtf9 / f W~ fuia p n (0)d-d 



< J fW 5 fmm pnW# = - 5 (E n (f) - / min ) rc> 0, 

and therefore P n (6^) — >■ 1. 



Theorem 12 Let L(j&) > be a bounded measurable function and suppose that the 
assumptions of Theorem [7JJ hold. Then the following two statements hold. 
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1. If the function LOO) is such that 

c := lim inf LOO) > 0, (11) 

and (p n )n£N concentrates on a maximum of f for n — > oo, then 
E n (Lf) _ J e f(0)L(0) Pn (0)dO 



E n (L) " J e L(0) Pn (0)dO 
2. If the function LOO) is such that 



/max- (12) 



c := lim inf LOO) > 0, (13) 
and (p n )n£N concentrates on a minimum of f for n — > oo, then 

Vn(Lf) 



E n (L) 



/min- (14) 



Remark 13 If L is strictly positive in each point in G where the function f reaches 
its maximum, resp. minimum, and is continuous in an arbitrary small neighborhood 
of those points, then fJl\) . resp. (G3j), are satisfied. 

Proof. We begin by proving the first statement of the theorem. Fix e and 5 
arbitrarily small, but S small enough such that inf^e,, LOO) > |. denote by L max 
the supremum of the function LOO) in 0. From Theorem [TTj we know that P n (G>5) > 
1 — e, for n sufficiently large. This implies, for n sufficiently large, 

E n (L) = [ L{-O)p n (-O)d-0 > [ L{V)p n OO)d-0 > ~(1 - e), (15) 
Je Je 5 * 

E n (•£'/) < E„(L/ max ) = / max E„,(L), (16) 

E n (L) = / L{-0)p n OO)d-0+ [ L(-0)p n (#)d# 
Jq s Je s 

< L max [ p n 0O)d-O+ f /W L(0)p n (0)d0 

JBs Je s Jmax y 

>1 0116,5 

< L max -e+ _ E n {Lf). (17) 

/max 

Combining ( |T5l) . ( |T6|) and ( |T7I) . we have 

E n (Lf) E n (L) - L max ^£ «[. L ma _ x -e 

J max _ TTi / r \ — V /max " J tt, / r \ _ v /max " J I 1 



E n (L) " wmax E n (L) -^ raax f(l-e) 

21 



Since the right-hand side of the last inequality tends to f mSLX for 5, e — > 0, and both 
5, e can be chosen arbitrarily small, we have 

E n (L) ^ /max ' 

To prove the second statement of the theorem, fix e and 5 arbitrarily small, but 
5 small enough such that inf^gg L{"&) > |. From Theorem [TTJ, we know that 

Pn(0<O > I — s, for n sufficiently large and therefore P n (Qg) < e - This implies, for 
n sufficiently large, 

E n (L) = f Hftpnitydd > [ L(0)p n (0)d0>%(l-e), (18) 



Je Je 6 2 

E n (Lf) > E n (L/ min ) = J min E„(L) =>> / min < " ^ . (19) 
Define the function 

By definition, the function is negative on 9,5 and is bounded, denote by K min the 
(negative) minimum of K. We have 



E„(L) = / L(0)p ft (0)d0 + f L(ti)p n (#)d# 
Je s Je s 

> [ L(V)p n {#)dti+ [ -^-Li^PnWdti 

JOs JB S /min^+ , 

<lon§a 

Je, \ jmin + / Jmin T Je 

* „ ' V v ' 

=K(#) =E n (Lf) 

> K miQ -P n (Q 5 )+ 1 -E n {Lf). 

Jmin ~r " 

It follows that 

(E n (L) - K min ■ P n (6,)) (/min + S) > E n (Lf), 

and thus, combining the last inequality with (fl8|) and (|T9|) . we obtain 

e b (l/) / |K min | ■ p n (e 5 ) 

Jmin < p , n < (/min + <5) 1 + 



E n (L) " wmm ' \ E n (L) 

< (/„ + .) + 
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Since the right-hand side of the last inequality tends to / m j n for 5, e — > 0, and both 
5, e can be chosen arbitrarily small, we have 

E n (L) ^ /min ' . 



B Proofs of the main results 

Proof of Theorem [3 Define, / min := inf^e /(#), /max := sup tfge /(<?), and 
define the bounded non-negative function /(#) := /(i?) — /mi n > 0. We have, 
/max = /max - /min- If M is such that a priori, E(/) = / max , then we have also that 
E(/) = /max, because, 

E(/) SUp Ep(f /min) SUp -Ep(/) /min E(/) /min /max /min /max- 

Then, it is possible to define a sequence (p n )neN C A^o such that E n (/) — ► / max - 
According to Theorem [T2l substituting L{"&) with P(s | in ([121) . we see that 
E n (/ | s) — » / max = E(/) and therefore E(/ | s) = E(/), from which follows that, 

E(/ | 8J /min E(/) /min /max /min- 

We can conclude that, E(/ | s) = E(/) = / max - In the same way, substituting E to 
E, we can prove that E(/ | s) = E(/) = / min . ■ 

Corollary [3] is a direct consequence of Theorem [2j 
Proof of Theorem [7J To prove Theorem [7J we need the following lemma. 

Lemma 14 Consider a dataset x with frequencies a = (a*, . . . , a*). Then, the fol- 
lowing equality holds, 

h=i llj=il s + J X J 

where s x := n + s and t* : = a ^ s * fc ■ VF/ien a* = 0, we sei n^il 5 ^ + J — 1 ) : = 1 by 
definition. 

A proof of Lemma [H] is in [PZT05] . Because P(x|#) = Y\.h=x"^hi according to 
Bayes' rule, we have | x) = G^r s x jt x($) and 
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Given a Dirichlet distribution dzr S) t(i?), the expected value E($j) is given by E(i?j) = 
i,- (see [KBJOOj ). It follows that 

a x + stj 
v J 1 ' 1 n + s 

We are now ready to prove Theorem UJ 

1. The first statement of Theorem [7J is a consequence of Corollary [61 Because Sj 
is independent of i9 given Xj for each i 6 N, we have 

P(s|x,0) = P(s|x), (21) 

and therefore, using (12ip and Bayes' rule, we obtain the likelihood function, 

k 

L(0) = P(s | 0) = P(s I x) ■ P(x | 0) = P(s |x) • (22) 

Because all the elements of the matrices A Si are nonzero, we have P(s | x) > 0, 
for each s and each x £ Af n . For each i? G 6, there is at least one x £ <Y™ 
such that nLi > °- !t follows that, 

k 

L(tf)= ^P(s|x)-n< 3X >0, 

for each # £ and therefore, according to Corollary [6] with n' = 1, the pre- 
dictive probabilities that are vacuous a priori remain vacuous also a posteriori. 

2. We have P(X n+1 = Xj | s) = E($j | s), and therefore, according to Lemma 
and Bayes' rule, 



P(X n+ i = xj | s) 



f e '& j P(s\'&)p('d)d'& 
j @ P(s | 





|x)P(x 


| &)p{<d)d'd 




x)P(x| 







x)P(x) 




x)dtf 




x)P(x) 



v / P(s|x)P(x) \ a* + st j 



It can be checked that the denominator of (1231) is positive and therefore con 
ditioning on events with zero probability is not a problem in this setting. 
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is a convex sum of fractions and is therefore a continuous function of t on T. 
Denote by x J the dataset of length n composed only by outcomes Xj, i.e., the 
dataset with a* 3 = n and a* 3 = for each h 7^ j. For all x 7^ x J we have 

a* + stj n - 1 + st« 1 n — l + s 

— < < < 1, 

n + s n + s n + s 

on T (the closure of T), only x- 7 has 

af + stj n + stj 
sup — = sup = 1 . 

teT n + s teT n + s 

A convex sum of fractions smaller than or equal to one is equal to one, only 
if the weights associated to fractions smaller than one are all equal to zero 
and there are some positive weights associated to fractions equal to one. If 
P(s I x 7 ) = 0, then (1231) is a convex combination of fractions strictly smaller 
than 1 on T and therefore P(X n+1 = Xj | s) < 1. If P(s | x?) 7^ 0, then letting 
tj — > 1, and consequently th — > for all h ^ j, according to (12D1) . we have 
P(x J ) — > 1 and P(x) — > for all x^x J , and thus, using ff23l) . 

1 > P(X n+1 = Xj I s) > lim P(X n+1 = x 7 - I s) = ' , . /. +s = 1. 

- V n+l j I y- t ^ V n+l 3 \ J P^]^) P(x>) 

If we have observed a manifest variable Si = Sh with A^*- = 0, it means that 
the observation excludes the possibility that the underlying value of Xj is Xj, 
therefore P(s | x J ) = and thus 

P(X n+ i = X j\s) <1. 

On the other hand, if P(X n+ i = Xj \ s) < 1, it must hold that P(s Ix- 5 ) = 0, 
i.e., that we have observed a realization of a manifest that is incompatible 
with the underlying (latent) outcome Xj. But a realization of a manifest that 
is incompatible with the underlying (latent) outcome only if the observed 
manifest variable was Sj = Sh with Xfj = 0. 

3. Having observed a manifest variable Sj = Sh, such that A fe *- 7^ and A fo * = 
for each r 7^ j in {1, . . . , k}, we are sure that the underlying value of Xj is Xj. 
Therefore, P(s | x) = for all x with a* = 0. It follows from (123]) that 

£ xe ^ >0 P(s|x)P(x).^ 

P(X„, + i = x 1 - 



j ■ 



Exe^,a ? >0 P ( S l X ) P ( X ) 

which is a convex combination of terms 



a* + stj a* 1 



n + s n + s n + s' 
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and is therefore greater than zero for each t G T. It follows that 

P(X n+1 = Xj | s) > — |— > 0. 

n + s 

On the other hand, if we do not observe a manifest variable as described above, 
it exists surely at least one x with a* = and P(s | x) > 0. In this case, using 
( 1231) and letting tj — > 0, we have, because of ( |20|) . that -P(x) — > for all x with 
Oj > 0. It follows that 

.. p ,„ . , ,. E„«...^of( S |x)P(x).^ 
hm P(X = Xj s) = hm — — — : — , „ . , . 

<>-° ' ^° Ex 6 ^,aX =0 ^(s|x)P(x) 

Assume for simplicity that, for all h ^ j, th 0, then P(x) > for all x with 
= and P(x) y4 0. Because, with a* = 0, we have 

hm — = hm = 0, 

ij^o n + s 



we obtain directly, 

< P(X n+ i = | s) = inf P(X n+ i = xj | s) < lirn P(X n+ i = Xj \ s) 



Corollary [8] is a direct consequence of Theorem [71 
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